Is there a humanist way of working with data (cf. Posner, “Data Trouble”)? This workshop prods at that question, examining tensions with quantification and categorization, and moving towards (we hope) repair and remediation. A hands-on portion will look at concrete ways to represent research topics as tabular data (aka a spreadsheet), one of the handiest and most portable of data formats.
This workshop introduces the challenges and benefits of translating humanities sources into machine-accessible formats. It also provides a gentle introduction to programming in R using the RStudio environment.
Take a look around: you’re in RStudio, an environment that makes coding in R more transparent. Your window is divided into four segments. In the upper right, you’ll see the environment: this displays all the objects you have loaded into memory (currently none). In the upper left, you’ll see the script editor: this is the place to work on code that you’re currently writing (or borrowing from elsewhere!). To run code in code chunks (the grey chunks), you can either press Ctrl+Enter to run single lines or click the green arrow to run the entire chunk. In the lower left, you’ll see the console: this is where code actually executes and where the output prints. In the lower right, you’ll see a few different things. The “Files” tab shows whatever files live in the directory (or folder) that R is currently working in; if you run any plots, they’ll show up in the “Plots” tab; you can also get help in the “Help” tab.
To start, if you’re using your own computer, run the section of setup code above by clicking the green arrow in the upper right of the grey box.
In her talk, “Data Trouble,” Posner enumerates some of the reasons behind the data negativity in some humanities circles. It goes beyond a fondness for close reading and an aversion to positivism. Those reasons fall under the sign of categorization. E.g.:
As an example of the first, Posner discusses the culturomics authors’ paper, specifically the visualization on inventions and their adoption rate.
A figure from “Quantitative Analysis of Culture…”
Demarcation – the separation of research observations into discrete items in order to perform operations on them.
In the above figure, the authors see proof that more recent cohorts of inventions were adopted more rapidly than previous cohorts. The visualization suggests that each invention cohort appeared like a bolt out of the blue. Humanists, on the other hand, tend to see inventions emerging from constellations of forces, integrating previous technologies in the succeeding technology. Not separate, in other words. And yet, separation is a necessary precursor to data manipulation and analysis. The computer can’t do much with an undifferentiated mass of information.
Parameterization – a standard scale of measurement to facilitate comparisons
Some things lend themselves well to parameterization: word counts, canvas size, the length of a song, or the shape of a pot. But within cultural artifacts, many features that are legible to human perception are difficult to parameterize, e.g. tone of voice, posture, facial expression. And some things that seem simple, e.g. how far it is from London to Edinburgh, change considerably depending on the historical period, mode of transport, class status, gender, weather, and so on.
Ontological stability – how we organize entities to make a data model
Ontologies get at the way in which one sees the world. To humanists, perspective changes things.
A runaway slave ad compared to its representation in the NJ Slavery database
replicability – the benefit of all well-structured/described data is reuse and replicability, but is that a thing in the humanities?
As Posner suggests, the interpretation of the data IS the work, and it would be foreign to most humanists to suggest that any other humanist reach the same conclusions.
boundedness – all historical records are partial, therefore no historical dataset is representative?
We have a fairly limited vocabulary for referencing gaps or absences in our data, but there are some strategies.
deracination – when creating data, some context is inevitably removed
It’s not possible to capture every aspect of a given source. But fortunately, that is okay. No humanist would assert that a digitized book is the same as the print book used as its source. They both accomplish different purposes, and have different affordances and drawbacks.
A letter from the Rutgers College War Service Bureau Records and its representations as TEI-XML and HTML
Some archival sources can be readily understood as data. Take, for example, this antebellum American newspaper’s subscription book. This is already a technology for organizing and storing data; we simply want to transcribe it so that we can represent and analyze it more effectively. Depending on one’s research, one may want to quantify subscribers by gender, map their locations, or calculate turnover.